Dependency Downloads

First, get all depencies and packages necessary to complete all tasks.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
    BiocManager::install("GEOmetadb")
if (!requireNamespace("kableExtra", quietly = TRUE))
    BiocManager::install("kableExtra")
if (!requireNamespace("gridExtra", quietly = TRUE))
    install.packages("gridExtra")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
    BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("knitr", quietly = TRUE))
    install.packages("knitr")
if (!requireNamespace("biomaRt", quietly = TRUE))
    BiocManager::install("biomaRt")
if (!requireNamespace("edgeR", quietly = TRUE))
    BiocManager::install("edgeR")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
    BiocManager::install("circlize")
if (!requireNamespace("limma", quietly = TRUE))
    BiocManager::install("limma")
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gridExtra)
library(knitr)
library(kableExtra)
library(GEOmetadb)
library(biomaRt)
library(org.Hs.eg.db)

Review of A1

The Expression Dataset I chose was GSE66306: Impact of bariatric surgery on RNA-seq gene expression profiles of peripheral monocytes in humans(Poitou et al. 2015).

About the Dataset

Summary

Genome expression profiles were taken from obese women before, and three months after bariatric surgery.

Control and Test Conditions

The conditions that were tested were:

  • Before bariatric surgery (T0)

  • 3 months after bariatric surgery (T3)

Why I was Interested

I have always been interested in health, and maintaining a healthy lifestyle; I was a competitive gymnast until I came to university. I always knew being healthy (whether that means eating healthily or being active / working out) had amazing benefits on your physical, mental, and emotional health. So a study that focused on health, and how a procedure like bariatric surgery can improve someone’s physical health, was of high interest to me.

Clean the Data and Map to HUGO Symbols

Downloading the Dataset and initial previewing

Now that we have looked into information about the dataset, let’s download my dataset using BiocManager(Morgan 2019) and GEOmetadb(Zhu et al. 2008)!

sfiles = getGEOSuppFiles('GSE66306')
fnames = rownames(sfiles)
PM_exp = read.delim(fnames[1],header=TRUE,
                       check.names = FALSE, stringsAsFactors = FALSE)

Let’s first find out the dimensions of my dataset: 23354, 40

This indicates that there are 23354 rows and 40 columns!

Let’s take a quick look at what the first few rows and columns of my data looks like:

Gene Name Ensembl Gene ID PM01_T0 PM01_T3 PM02_T0 PM02_T3
1/2-SBSRNA4 NA 62.71930 62.29415 66.85663 41.28218
A1BG ENSG00000121410 97.01571 87.47226 67.36065 83.48856
A1BG-AS1 ENSG00000268895 63.30869 69.03770 62.58948 26.72385
A1CF ENSG00000148584 0.00000 0.00000 0.00000 0.00000
A2LD1 NA 174.01275 155.20349 71.78328 90.46922

A quick summary of what I observed about my dataset:

  • There are 23354 genes

  • There are gene names, Ensemble gene IDs, and 12 different test cases (two situations per patient, per gene)

  • Not all genes have Ensembl gene IDs

  • The gene names used are either the HUGO approved symbol or an alias symbol

Organize the dataset into patient IDs and cell types

Before doing further analysis of the dataset, I first want to create a table that lists all patients and easily displays the patient ID as well as the specific cell type analyzed.

patients time
PM01_T0 PM01 T0
PM01_T3 PM01 T3
PM02_T0 PM02 T0
PM02_T3 PM02 T3
PM05_T0 PM05 T0
PM05_T3 PM05 T3
PM06_T0 PM06 T0
PM06_T3 PM06 T3
PM08_T0 PM08 T0
PM08_T3 PM08 T3

Filter weakly expressed features from my dataset

Now, back to my dataset. I want to filter out weakly expressed features, using edgeR:

The filtered dimesions of the dataset now are: 13083, 40.

This means that 10271 genes were removed. That means there were 10271 outliers.

Edit the HUGO gene symbols and Ensembl Gene IDs

As mentioned above, some of the genes are missing Ensembl gene IDs. This is a large issue and I had lots of difficulty trying to salvage as many genes as I could that were missing the Ensembl gene IDs.

First, I tried to separate the genes that were missing ensembl gene IDs from the other genes:

na_gene_ids <- PM_exp_filtered[which(is.na(PM_exp_filtered$`Ensembl Gene ID`)), 1]

There are 1064 genes without ensembl gene ids!

I also read in the paper that they used hg19 instead of the most recent ensembl. Therefore, after some google searching, I came across this article that states that hg19 is equivalent to Ensembl’s GRCh37. As we were shown how to use Ensembl, I went with GRCh37 for all future queries.

Method 1: Match the gene names given in dataset to Ensembl IDs

In my dataset I was given gene ids - some of these were the same as HUGO symbols, while some were aliases or older symbols. I tried to use these to find the associated ensembl gene ids:

grch37 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") # From https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
is_na <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
               filters = c("wikigene_name"),
               values = na_gene_ids,
               mart = ensembl_grch37)

I was fortunate to find 317 of my ensembl ids. I put them back into the dataset by:

for (i in 1:nrow(is_na)) {
  gene_name <- is_na[i,]$wikigene_name
  ensembl_gene_id <- is_na[i,]$ensembl_gene_id
  index <- which(PM_exp_filtered$`Gene Name` == gene_name)
  PM_exp_filtered[index,2] <- ensembl_gene_id
}
still_na <- na_gene_ids[which(!na_gene_ids %in% is_na$wikigene_name)] #remove all now identified gene names

Now, I am missing 808 ensembl ids.

Method 2: Use entrez gene ids on genes that begin with LOC

In my dataset there are quite a few genes that begin with the letters LOC. Dr. Isserlin suggested that if the LOC is removed, these ids can be used as entrez gene ids! I then separated all gene ids that began with LOC and performed a query to use the numbers from the gene ids (that began with LOC) to find matching ensembl gene ids:

LOC_indexes <- grep("^LOC", still_na) # Find all gene names beginning with LOC
LOC_names <- still_na[LOC_indexes]
no_LOC_names <- gsub("^LOC", "", LOC_names) # Remove all of the LOC from every gene name beginning with LOC
length(no_LOC_names) #362
## [1] 362
LOC_grch37 <- getBM(attributes = c("entrezgene_id", "ensembl_gene_id"),
                    filters = c("entrezgene_id"),
                    values = no_LOC_names,
                    mart = ensembl_grch37)

I was able to find 245. Now I will put them back into my dataset by:

for (i in 1:nrow(LOC_grch37)) {
  gene_name <- paste0("LOC", toString(LOC_grch37[i,]$entrezgene_id)) # Add back LOC that they will match with gene names
  ensembl_gene_id <- is_na[i,]$ensembl_gene_id
  index <- which(PM_exp_filtered$`Gene Name` == gene_name)
  PM_exp_filtered[index,2] <- ensembl_gene_id
}
left_LOC_na <- no_LOC_names[which(!no_LOC_names %in% LOC_grch37$entrezgene_id)] # Find all gene names that start with LOC
left_na <- still_na[-LOC_indexes] # Remove all gene names that start with LOC from the <NA> list

I am still left with 446 to attempt to find the ensembl gene ids for. I removed all ids that began with LOC from the list of indices I have left to check as that was the only check that would work in finding ensembl gene ids for genes beginning with LOC.

Method 3: Use list of known aliases to match with dataset gene names

When I was trying to find a solution to my missing ensembl ids, I came across this website and decided to use this as well! I will try and find proper gene names that map to my dataset’s gene names, and use those to find ensembl gene ids.

# To get the list of gene names and aliases
dbCon <- org.Hs.eg_dbconn()
sqlQuery <- 'SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;'
aliasSymbol <- dbGetQuery(dbCon, sqlQuery)

m <- matrix(ncol=2, byrow=TRUE)
colnames(m) <- c('old_symbol', 'new_symbol') # Old symbol is our gene name, new symbol is matching gene name
all_new_symbols <- c()

for (val in left_na) {
  if (val %in% aliasSymbol$alias_symbol) {
    index <- which(aliasSymbol$alias_symbol == val)
    proper_symbol <- aliasSymbol[index,]$symbol[1]
    m <- rbind(m, c(val, proper_symbol)) #to form association b/w the two
    all_new_symbols <- c(all_new_symbols, proper_symbol) #for next step, to match ensembl gene ids with
  }
}

# Get the ensembl gene ids that map to the new gene names
ensembl_w_new_names <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
            filters = c("wikigene_name"),
            values = all_new_symbols,
            mart = ensembl_grch37)


# Now, put all of the ensembl gene IDs into the chart
for (i in 1:nrow(ensembl_w_new_names)) {
  gene_name <- ensembl_w_new_names[i,]$wikigene_name # The new name we matched with our gene names
  old_gene_name <- m[which(m[,2] == gene_name)] # Gene names in our dataset
  ensembl_gene_id <- ensembl_w_new_names[i,]$ensembl_gene_id
  index <- which(PM_exp_filtered$`Gene Name` == old_gene_name)
  PM_exp_filtered[index,2] <- ensembl_gene_id
}
## Warning in PM_exp_filtered$`Gene Name` == old_gene_name: longer object length is
## not a multiple of shorter object length

This is the last method I could find. Even though there are still some genes that are missing ensembl ids, I will leave them in my dataset as they do have some form of identification, though the gene ids used may be aliases or older hugo symbols.

Finally, to actually find the HUGO symbols that map to all of thse ensembl gene ids and add them to the dataset:

# Find the HUGO symbols
all_HUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
                   filters = c("ensembl_gene_id"),
                   values = PM_exp_filtered$`Ensembl Gene ID`,
                   mart = ensembl_grch37)
PM_exp_filtered$"HUGO_symbol" <- NA # Add HUGO column to dataset

# Put hugo symbols into the dataset
for (i in 1:nrow(all_HUGO)) {
  ensembl_num <- all_HUGO[i,]$ensembl_gene_id
  hugo_sym <- all_HUGO[i,]$hgnc_symbol
  index <- which(PM_exp_filtered$`Ensembl Gene ID` == ensembl_num)
  PM_exp_filtered$"HUGO_symbol"[index] <- hugo_sym
}

Now, it is time to check for duplicates!

PM_table <- data.frame(table(PM_exp_filtered$`Ensembl Gene ID`))
all_duplicates <- PM_exp_filtered[PM_exp_filtered$`Ensembl Gene ID` %in% PM_table$Var1[PM_table$Freq > 1],] #check which ensembl ids have a higher frequency than 1, meaning they are duplicated
length(all_duplicates$`Gene Name`) #476
## [1] 476

I can see that my dataset has 476 duplicates! I want to see which of my genes are duplicates:

gene_duplicates <- all_duplicates$`Gene Name`
all_duplicates$`Gene Name`
##   [1] "ACBD6"          "ACPL2"          "AGAP5"          "AGAP6"         
##   [5] "AGAP7"          "AGAP8"          "AGAP9"          "ARMCX5-GPRASP2"
##   [9] "ASB3"           "ATP1A1OS"       "AZI1"           "BMS1P5"        
##  [13] "C10orf118"      "C11orf35"       "C11orf82"       "C12orf23"      
##  [17] "C12orf44"       "C17orf103"      "C17orf72"       "C19orf55"      
##  [21] "C19orf59"       "C1QTNF6"        "C1R"            "C1orf115"      
##  [25] "C1orf162"       "C1orf204"       "C1orf216"       "C1orf233"      
##  [29] "C1orf27"        "C1orf85"        "C20orf112"      "C20orf201"     
##  [33] "C22orf26"       "C2orf62"        "C3orf17"        "C3orf18"       
##  [37] "C3orf37"        "C3orf38"        "C3orf58"        "C3orf62"       
##  [41] "C4orf21"        "C5orf54"        "C8orf37"        "C8orf44"       
##  [45] "C8orf44-SGK3"   "C8orf45"        "C8orf46"        "C8orf58"       
##  [49] "C8orf59"        "C8orf76"        "C8orf80"        "C8orf82"       
##  [53] "CCDC144B"       "CCDC144C"       "CXXC1"          "CXXC5"         
##  [57] "CXorf26"        "CXorf36"        "CXorf56"        "CXorf65"       
##  [61] "CXorf69"        "ECRP"           "EIF4A2"         "FAM105B"       
##  [65] "FAM211A"        "FAM211B"        "FAM21A"         "FAM21B"        
##  [69] "FAM21C"         "FLJ13197"       "FLJ31813"       "FLJ33630"      
##  [73] "FLJ41200"       "FLJ45513"       "GIMAP1-GIMAP5"  "GIMAP5"        
##  [77] "GOLGA6L10"      "GOLGA6L9"       "GPR75-ASB3"     "GTF2H2C"       
##  [81] "GTF2H2D"        "HIST2H4A"       "HIST2H4B"       "HOXA9"         
##  [85] "IL8"            "IPW"            "ITFG2"          "KGFLP2"        
##  [89] "KIAA0947"       "KIAA1009"       "KIAA1737"       "KIAA1804"      
##  [93] "LIMS1"          "LIMS3L"         "LINC00341"      "LOC100009676"  
##  [97] "LOC100128071"   "LOC100128191"   "LOC100128338"   "LOC100128531"  
## [101] "LOC100128682"   "LOC100128788"   "LOC100129196"   "LOC100129250"  
## [105] "LOC100129269"   "LOC100129387"   "LOC100129931"   "LOC100129961"  
## [109] "LOC100130557"   "LOC100130581"   "LOC100130691"   "LOC100130855"  
## [113] "LOC100131067"   "LOC100131089"   "LOC100131096"   "LOC100131193"  
## [117] "LOC100131434"   "LOC100131691"   "LOC100131733"   "LOC100132077"  
## [121] "LOC100132247"   "LOC100132273"   "LOC100132526"   "LOC100132832"  
## [125] "LOC100133445"   "LOC100133612"   "LOC100133991"   "LOC100134229"  
## [129] "LOC100134368"   "LOC100134713"   "LOC100190938"   "LOC100233156"  
## [133] "LOC100233209"   "LOC100268168"   "LOC100270804"   "LOC100271836"  
## [137] "LOC100272216"   "LOC100272228"   "LOC100286844"   "LOC100287015"  
## [141] "LOC100287042"   "LOC100287177"   "LOC100287225"   "LOC100287314"  
## [145] "LOC100287559"   "LOC100287722"   "LOC100287792"   "LOC100288069"  
## [149] "LOC100288123"   "LOC100288198"   "LOC100288432"   "LOC100288637"  
## [153] "LOC100288748"   "LOC100288778"   "LOC100288846"   "LOC100289019"  
## [157] "LOC100289473"   "LOC100289561"   "LOC100293534"   "LOC100294362"  
## [161] "LOC100302401"   "LOC100306951"   "LOC100329109"   "LOC100335030"  
## [165] "LOC100379224"   "LOC100499177"   "LOC100499484"   "LOC100505495"  
## [169] "LOC100505576"   "LOC100505648"   "LOC100505666"   "LOC100505676"  
## [173] "LOC100505678"   "LOC100505702"   "LOC100505715"   "LOC100505761"  
## [177] "LOC100505776"   "LOC100505783"   "LOC100505812"   "LOC100505854"  
## [181] "LOC100505865"   "LOC100505989"   "LOC100506046"   "LOC100506054"  
## [185] "LOC100506083"   "LOC100506085"   "LOC100506100"   "LOC100506123"  
## [189] "LOC100506124"   "LOC100506497"   "LOC100506688"   "LOC100506710"  
## [193] "LOC100506714"   "LOC100506746"   "LOC100506844"   "LOC100506866"  
## [197] "LOC100506930"   "LOC100506963"   "LOC100506994"   "LOC100507032"  
## [201] "LOC100507034"   "LOC100507062"   "LOC100507217"   "LOC100507246"  
## [205] "LOC100507373"   "LOC100507392"   "LOC100507404"   "LOC100507423"  
## [209] "LOC100507424"   "LOC100507463"   "LOC100507495"   "LOC100507501"  
## [213] "LOC100507557"   "LOC100507567"   "LOC100507577"   "LOC100507632"  
## [217] "LOC100509894"   "LOC100527964"   "LOC100616668"   "LOC100630923"  
## [221] "LOC100652999"   "LOC100861402"   "LOC144571"      "LOC145783"     
## [225] "LOC145820"      "LOC147670"      "LOC147727"      "LOC147804"     
## [229] "LOC148189"      "LOC149837"      "LOC150776"      "LOC151174"     
## [233] "LOC151475"      "LOC151534"      "LOC152217"      "LOC152225"     
## [237] "LOC153684"      "LOC154761"      "LOC155060"      "LOC158257"     
## [241] "LOC158572"      "LOC200772"      "LOC202181"      "LOC202781"     
## [245] "LOC219347"      "LOC220729"      "LOC220906"      "LOC221442"     
## [249] "LOC253039"      "LOC254100"      "LOC254128"      "LOC255512"     
## [253] "LOC257396"      "LOC283089"      "LOC283143"      "LOC283299"     
## [257] "LOC283624"      "LOC283663"      "LOC283693"      "LOC283888"     
## [261] "LOC284023"      "LOC284385"      "LOC284440"      "LOC284454"     
## [265] "LOC284551"      "LOC284751"      "LOC284837"      "LOC284865"     
## [269] "LOC284950"      "LOC285033"      "LOC285074"      "LOC285103"     
## [273] "LOC285359"      "LOC285540"      "LOC285696"      "LOC285954"     
## [277] "LOC285965"      "LOC285972"      "LOC286059"      "LOC286186"     
## [281] "LOC286467"      "LOC338758"      "LOC338799"      "LOC338817"     
## [285] "LOC339166"      "LOC339290"      "LOC339524"      "LOC339803"     
## [289] "LOC339874"      "LOC340037"      "LOC340515"      "LOC340544"     
## [293] "LOC349196"      "LOC374443"      "LOC375190"      "LOC386597"     
## [297] "LOC387646"      "LOC387647"      "LOC388387"      "LOC389634"     
## [301] "LOC389641"      "LOC400027"      "LOC400604"      "LOC400657"     
## [305] "LOC400891"      "LOC400958"      "LOC401074"      "LOC401093"     
## [309] "LOC401320"      "LOC401397"      "LOC401588"      "LOC439994"     
## [313] "LOC440173"      "LOC440288"      "LOC440300"      "LOC440600"     
## [317] "LOC440944"      "LOC440970"      "LOC441666"      "LOC442028"     
## [321] "LOC541471"      "LOC541473"      "LOC550112"      "LOC550643"     
## [325] "LOC595101"      "LOC613037"      "LOC619207"      "LOC641298"     
## [329] "LOC641367"      "LOC642361"      "LOC642852"      "LOC643406"     
## [333] "LOC643529"      "LOC643623"      "LOC643733"      "LOC643770"     
## [337] "LOC643837"      "LOC644172"      "LOC644242"      "LOC644714"     
## [341] "LOC645166"      "LOC645212"      "LOC645513"      "LOC645676"     
## [345] "LOC645752"      "LOC646214"      "LOC646329"      "LOC646762"     
## [349] "LOC646851"      "LOC648740"      "LOC648987"      "LOC652276"     
## [353] "LOC653061"      "LOC653160"      "LOC653501"      "LOC653712"     
## [357] "LOC678655"      "LOC728024"      "LOC728190"      "LOC728431"     
## [361] "LOC728537"      "LOC728554"      "LOC728558"      "LOC728606"     
## [365] "LOC728730"      "LOC728752"      "LOC728855"      "LOC729678"     
## [369] "LOC729737"      "LOC729799"      "LOC730091"      "LOC730102"     
## [373] "LOC730227"      "LOC731424"      "LOC731789"      "LOC92249"      
## [377] "LOC93622"       "LOC96610"       "LONP2"          "LSMD1"         
## [381] "MATR3"          "MGC21881"       "MGC57346"       "MIR1248"       
## [385] "MIR3653"        "MRP63"          "MST4"           "NAA38"         
## [389] "NBPF11"         "NBPF16"         "NBPF24"         "NS3BP"         
## [393] "OSTC"           "OSTCP1"         "PAR-SN"         "PARPBP"        
## [397] "PARS2"          "PIDD"           "PLSCR3"         "PMS2P5"        
## [401] "PRH1-PRR4"      "PRKRIP1"        "PROSC"          "PROSER1"       
## [405] "PSIP1"          "PSKH1"          "SGK3"           "SNHG4"         
## [409] "SPDYE2"         "SPDYE5"         "TARS"           "TARS2"         
## [413] "THADA"          "THAP1"          "TMED1"          "TMED10"        
## [417] "TMEM184A"       "TMEM184B"       "TMEM5"          "TMEM50A"       
## [421] "TMEM86A"        "TMEM86B"        "TNFAIP8L3"      "TNFRSF10A"     
## [425] "TREML4"         "TRERF1"         "TRIM8"          "TRIM9"         
## [429] "TSG101"         "TSGA10"         "TXN"            "TXN2"          
## [433] "UBE2O"          "UBE2Q1"         "UGCG"           "UGDH"          
## [437] "UQCR10"         "UQCR11"         "UQCRQ"          "URB1"          
## [441] "WASH2P"         "WASH3P"         "WDR7"           "WDR70"         
## [445] "ZFP14"          "ZFP161"         "ZHX2"           "ZHX3"          
## [449] "ZNF169"         "ZNF17"          "ZNF189"         "ZNF19"         
## [453] "ZNF195"         "ZNF197"         "ZNF253"         "ZNF254"        
## [457] "ZNF26"          "ZNF260"         "ZNF281"         "ZNF436"        
## [461] "ZNF438"         "ZNF500"         "ZNF501"         "ZNF543"        
## [465] "ZNF544"         "ZNF671"         "ZNF672"         "ZNF674"        
## [469] "ZNF675"         "ZNF736"         "ZNF737"         "ZNF768"        
## [473] "ZNF77"          "ZNHIT1"         "ZNHIT2"         "ZNRF3"

If I remember correctly, the gene names listed here look an awful lot like the list of gene names who origianlly were missing ensemble gene ids! Let’s check:

orig_missing_ensembl <- length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)]))
length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)])) #406
## [1] 406

Wow! Most of the genes that are duplicates originally had no ensembl gene ids! As these duplicates make up around 3% of my dataset, I am going to leave all of these values in. I don’t feel comfortable removing genes, especially when I am unsure of the fact that the genes that are duplicated are being mapped 100% correctly.

Normalize the Data

Before perfoming any normalization on my dataset, I just wanted to be able to visualize my data.

I chose to use a boxplot because I found it to be the easiest representation to view the data as it showed distributions of each sample’s (PM_#) values and lots of information about them in one plot (interquartile range, first and third quartiles, and outliers).

I also used a denstiy plot as it is similar to a histogram, but you are able to easily view the distribution of data over a continuous interval of patient’s expression of the genes.

Now that I have been able to get an overview of what my data looks like, I will normalize the data:

filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d = calcNormFactors(d)
normalized_counts <- cpm(d)

A few of the normalized factors can be displayed:

group lib.size norm.factors
PM01_T0 T0 33572965 0.8029624
PM01_T3 T3 32940023 0.9024528
PM02_T0 T0 28698158 1.0369125
PM02_T3 T3 28868780 1.0366107
PM05_T0 T0 24130290 0.9819000

We can see that there will be minor modifications to the dataset, but these modifications will still have a slight impact (as seen from the norm.factors column).

From this plot, I can automatically see that all patients, before and after surgery had very similar interquartile ranges, with a mean around 4. There seemed to be quite a few outliers, many on the more negative side, indicating much lower expression occurred slightly more frequently than very high expression.

The differences between pre-normalization and post-normalization are very minimal, especially in this graph. The different lines indiciating different patients are tigher together, however the mean has not shifted much. It seems that most pateints gene expression hovers around 0.18, with PM02_T3 dipping slightly lower at 0.15.

Now we have caught up with how the dataset was normalized and cleaned, and have a copy of it for further analysis!

kable(PM_exp_filtered[1:10,1:10], format = "html")
Gene Name Ensembl Gene ID PM01_T0 PM01_T3 PM02_T0 PM02_T3 PM05_T0 PM05_T3 PM06_T0 PM06_T3
1 1/2-SBSRNA4 ENSG00000247950 62.71930 62.29415 66.85663 41.28218 50.56757 22.50039 33.88546 25.70598
2 A1BG ENSG00000121410 97.01571 87.47226 67.36065 83.48856 186.39012 165.66641 130.99126 136.03277
3 A1BG-AS1 ENSG00000268895 63.30869 69.03770 62.58948 26.72385 118.38336 118.66857 64.01149 68.88014
5 A2LD1 ENSG00000134864 174.01275 155.20349 71.78328 90.46922 118.88036 117.50502 102.13343 111.10433
6 A2M ENSG00000175899 19.36841 16.66238 37.17075 18.27523 15.14377 37.73556 32.69034 26.84309
13 AAAS ENSG00000094914 595.53115 596.03090 494.07056 567.92197 476.07756 473.95374 488.67144 430.47829
14 AACS ENSG00000081760 174.93491 220.07247 213.37203 195.32551 218.75090 229.04913 174.27876 224.90844
21 AAGAB ENSG00000103591 932.06316 939.71988 994.14134 1006.52816 714.96474 747.77940 838.74404 723.04105
22 AAK1 ENSG00000115977 1318.43352 2106.26480 2545.63080 2495.16544 1405.69820 1428.93760 1780.92956 1493.12287
23 AAMP ENSG00000127837 2110.10878 2056.08450 1639.15665 1687.72253 1380.47480 1649.38669 1505.34146 1429.78255

Differential Gene Expression

Now I will begin analysis of gene expression from my dataset.

Clustering of Dataset

Before performing any analysis, I will be begin with plotting an MDS graph to analyze the culstering between patients before the surgery and three hours after the surgery.

# Make certain modifications to the dataset to remove issues of errors later in the report
#Only incude the genes whose ensembl gene ids are not NULL
PM_exp_filtered <- PM_exp_filtered[-which(is.na(PM_exp_filtered$`Ensembl Gene ID`)),]
#Get rid of all duplicates as well
PM_exp_filtered <- PM_exp_filtered[-which(PM_exp_filtered$`Gene Name` %in% all_duplicates$`Gene Name`),]

heatmap_matrix <- PM_exp_filtered[,4:ncol(PM_exp_filtered)-1]
rownames(heatmap_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(heatmap_matrix) <- colnames(PM_exp_filtered[,4:ncol(PM_exp_filtered)-1])

plotMDS(heatmap_matrix, col = rep(c("darkgreen","blue"),10))

As shown in the graph, there is no two clusterings of patients before (T0) surgery and after (T3) surgery. It seems that most patients cluster just after 0 on the x-axis, for both results of before or after surgery. This indicates that my dataset is not the most ideal dataset, as in an ideal dataset the two treatment cases (T0 and T3) are in two very separate clusters. This may just add noise to my results later on in this process.

To have a further look at my dataset, I will plot an MDS graph and analyze the clustering between each individual patient before and after surgery.

pat_colors <- rainbow(10)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
plotMDS(heatmap_matrix, col = pat_colors )

As shown in the graph, there is a slight clustering within patients before (T0) and after surgery (T3). This is not ideal within a dataset, since patients should not cluster around eachother in the treatment and control cases (T0 and T3). Again, this indicates that noise will be present and may be an issue later on in the gene expression analysis.

Now, I will begin gene expression analysis of my dataset.

Analysis Using Limma

First, I will create a linear model using Limma(Ritchie et al. 2015).

model_design <- model.matrix(~ samples$time)
kable(model_design, type="html")
(Intercept) samples$timeT3
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1
1 0
1 1

Next, I will fit my dataset to the linear model by applying empirical Bayes, which uses my data to specify the baseline, or set a prior observation. Then, I will create a table in which the p-values and adjusted p-values for gene expression are calculated.

expressionMatrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(expressionMatrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(expressionMatrix) <- colnames(PM_exp_filtered)[3:40]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)

fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2, 
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(PM_exp_filtered[,c(2,41)],
                     topfit,
                     by.y=0,by.x=1,
                     all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]

#view hits
kable(output_hits[1:10,],type="html")
Ensembl Gene ID HUGO_symbol logFC AveExpr t P.Value adj.P.Val B
10300 ENSG00000196935 SRGAP1 20.69417 45.12257 3.860254 0.0004222 0.9999188 -4.595103
6417 ENSG00000153283 CD96 25.67070 51.88529 3.720099 0.0006360 0.9999188 -4.595104
11960 ENSG00000269821 KCNQ1OT1 60.93707 186.82364 3.338608 0.0018822 0.9999188 -4.595107
276 ENSG00000020426 MNAT1 -20.75998 78.76815 -3.305797 0.0020617 0.9999188 -4.595107
11638 ENSG00000248905 FMN1 154.66557 441.55675 3.243284 0.0024497 0.9999188 -4.595107
6728 ENSG00000157933 SKI 302.04315 2099.74290 3.191329 0.0028240 0.9999188 -4.595108
2544 ENSG00000109790 KLHL5 -298.23446 2083.81718 -3.188445 0.0028463 0.9999188 -4.595108
318 ENSG00000026652 AGPAT4 218.70614 977.10470 3.143107 0.0032195 0.9999188 -4.595108
8274 ENSG00000170264 FAM161A 25.32429 59.14808 3.082983 0.0037863 0.9999188 -4.595109
9815 ENSG00000185862 EVI2B -4826.44290 22858.95871 -3.058255 0.0040457 0.9999188 -4.595109

Next, I want to see how many of my genes had significant enough expression before, and after adjustment.

#How many gene pass the threshold p-value < 0.05?
length(which(output_hits$P.Value < 0.05))
## [1] 223
#How many genes pass correction?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 0

It seems that there are 223 genes that are significant before adjustment, but none after. This indicates that futher processing must be completed to get p-values that have been adjusted that are still significant.

Multiple Hypothesis Testing

A method that can be used to improve the results of the adjusted p-values is multiple hypothesis testing! This helps to control for patient variability, by taking both the patients’ gene expression AND the number of patients into account, instead of just the patients’ gene expression.

First, the linear model needs to be updated so that it now takes into consideration the patients’ gene expression and the number of patients.

model_design_pat <- model.matrix(~ samples$patients + samples$time)
model_design_pat[1:10,1:5]
##    (Intercept) samples$patientsPM02 samples$patientsPM05 samples$patientsPM06
## 1            1                    0                    0                    0
## 2            1                    0                    0                    0
## 3            1                    1                    0                    0
## 4            1                    1                    0                    0
## 5            1                    0                    1                    0
## 6            1                    0                    1                    0
## 7            1                    0                    0                    1
## 8            1                    0                    0                    1
## 9            1                    0                    0                    0
## 10           1                    0                    0                    0
##    samples$patientsPM08
## 1                     0
## 2                     0
## 3                     0
## 4                     0
## 5                     0
## 6                     0
## 7                     0
## 8                     0
## 9                     1
## 10                    1

Then, we again fit the model using empirical Bayes.

fit_pat <- lmFit(minimalSet, model_design_pat)
fit2_pat <- eBayes(fit_pat,trend=TRUE)
topfit_pat <- topTable(fit2_pat, 
                   coef=ncol(model_design_pat),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_pat <- merge(PM_exp_filtered[,c(2,41)],
                         topfit_pat,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
kable(output_hits_pat[1:10,],type="html")
Ensembl Gene ID HUGO_symbol logFC AveExpr t P.Value adj.P.Val B
6152 ENSG00000149428 HYOU1 224.10735 1584.97806 5.214442 0.0000409 0.2939791 -4.531650
10178 ENSG00000196218 RYR1 18.61091 128.37796 4.784806 0.0001103 0.2939791 -4.536787
4801 ENSG00000135540 NHSL1 26.21227 101.38153 4.722608 0.0001274 0.2939791 -4.537575
8501 ENSG00000172037 LAMB2 74.16278 366.66467 4.718775 0.0001286 0.2939791 -4.537624
8102 ENSG00000168890 TMEM150A 107.76304 528.90158 4.701590 0.0001338 0.2939791 -4.537844
11546 ENSG00000242732 RGAG4 66.27998 313.35280 4.667872 0.0001448 0.2939791 -4.538278
11960 ENSG00000269821 KCNQ1OT1 60.93707 186.82364 4.608272 0.0001664 0.2939791 -4.539054
276 ENSG00000020426 MNAT1 -20.75998 78.76815 -4.490611 0.0002191 0.3023822 -4.540616
11254 ENSG00000227460, ENSG00000197283 NA 46.40860 282.45028 4.488754 0.0002201 0.3023822 -4.540641
3401 ENSG00000119574 ZBTB45 -39.56290 135.70651 -4.418770 0.0002593 0.3078740 -4.541590

Next, we look at the p-values again like we did previously.

#How many gene pass the threshold p-value < 0.05?
length(which(output_hits_pat$P.Value < 0.05))
## [1] 933
#How many genes pass correction?
length(which(output_hits_pat$adj.P.Val < 0.05))
## [1] 0

Unfortunately, the results are again slightly disappointing in that there are still no genes with adjusted p-values that have significant gene expression.

Analysis Using EdgeR

When I was using Limma, I could not get any significantly differentially expressed genes, before or after multiple hypothesis testing.

I then re-read my paper and noticed that they also performed gene differential analysis, but used a different package than Limma, as the data seemed to follow a negative binomal distribution. Though Limma works well on data that follows a linear distribution, it should not be used by my data as my data follows a negative binomial distribution instead. EdgeR(MD, DJ, and GK 2010) is another package that can be used for differential expression, and is great for data that follows a negative binomial distribution - just like mine!

I will confirm that my data follows a negative binomial distribution by plotting the mean-variance graph of my data:

filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d <- estimateDisp(d, model_design_pat)
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
            show.ave.raw.vars = TRUE,
            NBline=TRUE,
            show.binned.common.disp.vars = TRUE)

The blue line in the above graph is the negative binomial line. As you can see by the red X’s, my data follows the negative binomial distribution. Therefore, I can use the edgeR package on my dataset!

First, I will use the edgeR glmQLFTest to fit my dataset. I will fit my dataset using the multiple hypothesis test like I did with Limma, where I used both the number of patients and the pateints’ gene expression in an attempt to control for the patients’ variability.

#model_design_pat already is set to use both the number of patients and the patients' data
fit <- glmQLFit(d, model_design_pat) 
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
kable(topTags(qlf.pos_vs_neg), type="html")
logFC logCPM F PValue FDR
ENSG00000199753 15.5436097 6.757123 4856.41928 0.0e+00 0.0005724
ENSG00000207241 13.9573395 5.272170 2510.36283 2.0e-07 0.0012801
ENSG00000264553 -12.6905808 4.001549 1736.68486 5.0e-07 0.0019685
ENSG00000207741 10.7899386 2.145184 929.80788 2.0e-06 0.0049516
ENSG00000172037 0.3888091 3.769320 40.14231 2.0e-06 0.0049516
ENSG00000149428 0.3047276 5.881178 38.12662 2.9e-06 0.0054999
ENSG00000157933 0.3141314 6.296814 37.78702 3.1e-06 0.0054999
ENSG00000242732 0.4078300 3.557174 35.46713 4.9e-06 0.0075674
ENSG00000004478 0.3380862 4.902422 34.79577 5.6e-06 0.0076950
ENSG00000168890 0.4101843 4.306209 32.39425 9.2e-06 0.0113602
x
BH
x
samples$timeT3
x
glm

Next, I will check to see if there are any genes with significant gene expression using this new model.

qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue", n = nrow(PM_exp_filtered))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 2911
length(which(qlf_output_hits$table$FDR < 0.05)) #FDR is adjusted p-values
## [1] 138

Fortunately, there are 138 genes with significant gene expression, even after adjustment!

Next, I will plot the data on a heatmap(Gu, Eils, and Schlesner 2016), and look at the differences between patients before surgery (T0) and after surgery (T3).

top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$FDR<0.05] 
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),]))) 
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
       c(grep(colnames(heatmap_matrix_tophits),pattern = "T0"),                            grep(colnames(heatmap_matrix_tophits),pattern = "T3"))]
if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
  }
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                               show_row_dend = TRUE,
                               show_column_dend = FALSE,
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                               )
current_heatmap

The heatmap indicates that there is no or very little clustering by conditions. The expression levels vary in both conditions to a large degree, though there seems to be slightly less gene expression before surgery (T0), indicated by a purple colour and there is slightly more gene expression after surgery (T3), indicated by a red colour.

Genes of Interest

Finally, I will use an MA plot to look at genes of interest in my dataset:

First, I will use a volcano plot to plot the upregulated genes; they can be seen as the orange circles on the graph.

plotVolcano <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(plotVolcano) <- c("logFC", "P-val")
upregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 1
downregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(upregulated, "orange", "black")
plot(plotVolcano, col = point.col)

Next, I will use a volcano plot to plot the downregulated genes; they can be seen as green circles on the graph.

point.col <- ifelse(downregulated, "green", "black")
plot(plotVolcano, col = point.col)

Thresholding

Finding Upregulated and Downregulated Genes

I am going to continue to use edgeR as it allowed me to find genes that passed correction. With edgeR, I will be finding the downregulated and upregulated genes from my dataset.

d <- calcNormFactors(d)
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
                           n = nrow(filtered_data_matrix))

length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC > 0))
## [1] 732
length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC < 0))
## [1] 709

I was able to find that there are 732 upregulated genes, and 709 downregulated genes! Now I will find the ensembl gene IDs of these genes and save them in text files, which I will put into the g:profiler(Raudvere et al. 2019) webpage so that I can see what datasets match those genes best.

qlf_output_hits_withgn <- merge(PM_exp_filtered[,1:2],qlf_output_hits, by.x=2, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]

#Get all of the gene sets
upregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC < 0)]
all_genes <- c(downregulated_genes, upregulated_genes)

#Save all gene sets to text files to be able to use in g:profiler
write.table(x=upregulated_genes,
            file=file.path("data","PM_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
            file=file.path("data","PM_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=all_genes,
            file=file.path("data","PM_all_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

Now that I have text files with all of the necessary genes, I will use these text files in g:profiler. The results can be found in images below.

Annotation Data Used

The annotation data sets that I used were GO biological process, Reactome, and WikiPathways. There were no results for WikiPathways, so this annotation data set results will not be shown below. I used these annotation data sets because I am familiar with how they all work from previous assignments, as well as they all have important strengths that can be leveraged. Reactome is useful in identifying full molecular details of a pathway, while GO is useful to learn more about the outcome or ending state of molecular functions. I am using version e98_eg45_p14_ce5b097 of the annotation data sets.

Number of Genesets Returned with Chosen Thresholds

I used the same thresholds from when I performed gene expression analysis; again I will be looking at FDR (Benjamini-Hochberg method) that are below 0.05, as this is the usual threshold that is set for other quantifiers such as p-values.

The amount of genesets returned from each annotation data set for each query performed are shown below.

Up-Regulation Results

  • Go biological pathways: 23 genesets

  • Reactome: 53 genesets

Down-Regulation Results

  • GO biological pathways: 91 genesets

  • Reactome: 3 genesets

Combined Up-Regulation and Down-Regulation Results

  • GO biological pathways: 101 genesets

  • Reactome: 14 genesets

G:Profiler Results Seperately vs. Together

Results for up-regulation are as follows:

The top term genesets returned have to do with response to heat and transporting mRNA.

Results for down-regulation are as follows:

The top term genesets returned have to do with building and breaking down cells, and actions of blood cells (oxygenating, deoxygenating, degranulation).

Results for both down-regulation and up-regulation are as follows:

The top term genesets returned have to do with SUMOylation of different proteins, and metabolic and catabolic processes.

Comparing GO results from all

It seems that all the results for GO invovled some form of regulation or organization of different parts or aspects of the cell. It seems that for the results where both down-regulation and up-regulation were included, the top term genesets returned were a combination of the results from down-regulation and up-regulation, as to be expected when combining the genes together.

Comparing Reactome results from all

It seems that the results for Reactome were again, a combination of both genesets returned from down-regulation and up-regulation. However, as opposed to GO, instead of having an almost equal spread of top term genesets between down and up-regulation, most of the top term genesets were from the up-regulation rather than the down-regulation.

Interpretation

Over-Representation Results vs. Conclusions of Paper

In the paper I chose, they also performed gene expression analysis and thersholding, so I could compare my results with theirs.

The paper specifies that the upregulated genes were “mostly involved in the basal cellular machinery.” Basal cellular machinery has to do with forming the basic building blocks of the cell. This aligns with my results, as my results have to do with transporting mRNA, which is one of the basic building blocks to create cells.

The paper specifies that the down-regulated genes were “enriched in metabolic functions of adipose tissue”. This corresponds well with my results, since the results I found were genesets that broke and created cells, which are metabolic functions.

Other Papers Supporting Results

There are other papers who perform research to observe the effects after bariatric surgery on another’s body.

Another article(Santos et al. 2014) I found performed research on different aspects of patients’ physical and cellular heath before and three months after bariastric surgery. The results that they found were that the patients’ weight went down, and there were decreased neutrophil and triglyceride counts. The results that I found result were similar to the results of this paper as the genesets that were downregulated were actions of blood cells (like nutrophils), as well as breaking down cells like triglycerides. Also, my results indicate that response to heat is up-regulated, leading me to believe that the increased response to heat causes the patients to lose weight, just as the paper described.

There was also an article(Sparks et al. 2015) that reported findings on arthritic patients’ health before and after bariatric surgery. The results were similar to the article above, where the patients again lost weight, but in addition, in this article it was documented that there was a lower erythrocyte sedimentation rate. This again matches with my results, as a down-regulated gene was oxygentation and deoxygenation of blood cells, which means that erythrocyte sedimentation rate will decrease as a result since the erythrocytes ability to fall quickly.

References

Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.

MD, Robinson, McCarthy DJ, and Smyth GK. 2010. “edgeR a Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (7): 139–40.

Morgan, Martin. 2019. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.

Poitou, Christine, Claire Perret, François Mathieu, Vinh Truong, Yuna Blum, Hervé Durand, Rohia Alili, et al. 2015. “Bariatric Surgery Induces Disruption in Inflammatory Signaling Pathways Mediated by Immune Cells in Adipose Tissue: A Rna-Seq Study.” PLoS One 10 (5). Public Library of Science.

Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “g:Profiler a Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkz369.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Santos, J, P Salgado, C Santos, P Mendes, J Saavedra, P Baldaque, L Monteiro, and E Costa. 2014. “Effect of Bariatric Surgery on Weight Loss, Inflammation, Iron Metabolism, and Lipid Profile.” Scandinavian Journal of Surgery 103 (1). SAGE Publications Sage UK: London, England: 21–25.

Sparks, Jeffrey A, Florencia Halperin, Jonathan C Karlson, Elizabeth W Karlson, and Bonnie L Bermas. 2015. “Impact of Bariatric Surgery on Patients with Rheumatoid Arthritis.” Arthritis Care & Research 67 (12). Wiley Online Library: 1619–26.

Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful Alternative Search Engine for the Gene Expression Omnibus.” Bioinformatics (Oxford, England) 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.